-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38030][SQL] Canonicalization should not remove nullability of AttributeReference dataType #35332
Conversation
Can one of the admins verify this patch? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
assert(cast.resolved) | ||
// canonicalization should not converted resolved cast to unresolved | ||
assert(cast.canonicalized.resolved) | ||
assert(cast.canonicalized.dataType == structType.asNullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use triple-equals here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
TryCast(attr, structType))) { | ||
assert(cast.resolved) | ||
// canonicalization should not converted resolved cast to unresolved | ||
assert(cast.canonicalized.resolved) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be asserting on cast.canonicalized.output
as well? this is how the original issue was detected, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue was detected because of the following plan UnionExec -> ProjectExec --> Alias --> Cast --> AttributeReference
and a call to .output
on UnionExec
which in turn calls ProjectExec.output.nullable
. I can recreate this chain here, but not sure if thats useful. The root issue was that a resolved node was being converted to unresolved which we are testing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah I think the current test is probably sufficient.
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala
Outdated
Show resolved
Hide resolved
// When this cast involves TimeZone, it's only resolved if the timeZoneId is set; | ||
// Otherwise behave like Expression.resolved. | ||
override lazy val resolved: Boolean = | ||
childrenResolved && checkInputDataTypes().isSuccess && (!needsTimeZone || timeZoneId.isDefined) | ||
|
||
override lazy val preCanonicalized: Expression = { | ||
val basic = withNewChildren(Seq(child.preCanonicalized)).asInstanceOf[CastBase] | ||
.withDataType(dataType.asNullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment to say that this matches the canonicaliztion of AttributeReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that other kind of child's canonicalized format doesn't follow the canonicaliztion of AttributeReference on nullability change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I can't recall why AttributeReference
needs to remove nullability when canonicalization. Maybe we should change AttributeReference
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why AttributeReference
needs to remove nullability either, however I don't have much domain knowledge here, so I opted for a safer change to only fix Cast
. There can be other Cast
like expressions which do not handle removing nullability of their dataType params.
I had made a test run a few days ago for preserving nullability in AttributeReference
during canonicalization and the existing tests passed. I don't think making this change should lead to failures, but it may lead to some lost optimization? (From what I can see, Canonicalization is used to compare to query plans, and with .asNullable
, it is more likely that plans will match)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the other hand, I think it's safer to not remove nullability when comparing query plans. Maybe two plans are truly different because the nullability is different and the data is also different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think its probably more deviation in behavior, but safer overall. I will update this PR soon.
…AttributeReference dataType
5cf60ce
to
6b77f66
Compare
@cloud-fan @viirya @xkrogen Thanks for the reviews! I have updated the PR with the new approach of preserving nullability during canonicalization of |
Hi folks, can someone take another look at the PR? |
LGTM, also cc @sigmod |
thanks, merging to master! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm too
…y of AttributeReference dataType This is a backport of #35332 to branch 3.2 ### What changes were proposed in this pull request? Canonicalization of AttributeReference should not remove nullability information of its dataType. ### Why are the changes needed? SPARK-38030 lists an issue where canonicalization of cast resulted in an unresolved expression, thus causing query failure. The issue was that the child AttributeReference's dataType was converted to nullable during canonicalization and hence the Cast's `checkInputDataTypes` fails. Although the exact repro listed in SPARK-38030 no longer works in branch-3.2 due to an unrelated change (details in the JIRA), some other codepaths which depend on canonicalized representations can trigger the same issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test to ensure that canonicalization preserves nullability of AttributeReference and does not result in an unresolved cast Closes #35446 from shardulm94/SPARK-38030-3.2. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…y of AttributeReference dataType This is a backport of #35332 to branch 3.1 ### What changes were proposed in this pull request? Canonicalization of AttributeReference should not remove nullability information of its dataType. ### Why are the changes needed? SPARK-38030 lists an issue where canonicalization of cast resulted in an unresolved expression, thus causing query failure. The issue was that the child AttributeReference's dataType was converted to nullable during canonicalization and hence the Cast's `checkInputDataTypes` fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test to ensure that canonicalization preserves nullability of AttributeReference and does not result in an unresolved cast. Also added a test to ensure that the issue observed in SPARK-38030 (interaction of this bug with AQE) is fixed. This test/repro only works on 3.1 because the code which triggers access on an unresolved object is [lazy](https://github.com/apache/spark/blob/7e5c3b216431b6a5e9a0786bf7cded694228cdee/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L132) in 3.2+ and hence does not trigger the issue in 3.2+. Closes #35444 from shardulm94/SPARK-38030-3.1. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…y of AttributeReference dataType This is a backport of apache#35332 to branch 3.2 ### What changes were proposed in this pull request? Canonicalization of AttributeReference should not remove nullability information of its dataType. ### Why are the changes needed? SPARK-38030 lists an issue where canonicalization of cast resulted in an unresolved expression, thus causing query failure. The issue was that the child AttributeReference's dataType was converted to nullable during canonicalization and hence the Cast's `checkInputDataTypes` fails. Although the exact repro listed in SPARK-38030 no longer works in branch-3.2 due to an unrelated change (details in the JIRA), some other codepaths which depend on canonicalized representations can trigger the same issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test to ensure that canonicalization preserves nullability of AttributeReference and does not result in an unresolved cast Closes apache#35446 from shardulm94/SPARK-38030-3.2. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d62735d)
What changes were proposed in this pull request?
Canonicalization of AttributeReference should not remove nullability information of its dataType.
Why are the changes needed?
SPARK-38030 lists an issue where canonicalization of cast resulted in an unresolved expression, thus causing query failure. The issue was that the child AttributeReference's dataType was converted to nullable during canonicalization and hence the Cast's
checkInputDataTypes
fails. Although the exact repro listed in SPARK-38030 no longer works in master due to an unrelated change (details in the JIRA), some other codepaths which depend on canonicalized representations can trigger the same issue.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit test to ensure that canonicalization preserves nullability of AttributeReference and does not result in an unresolved cast